{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center> RegEx in Python</center>\n",
    "\n",
    "![](images/memes/meme1.jpg)\n",
    "\n",
    "\n",
    "> ### *A regular expression is a sequence of characters that define a search pattern.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. What exactly is a Regular Expression?\n",
    "\n",
    "A regular expression, often called a pattern, is **an expression used to specify a set of strings** required for a particular purpose. \n",
    "\n",
    "- A simple way to specify a finite set of strings is to list its elements or members. <br>For example `{file, file1, file2}`. \n",
    "    \n",
    "\n",
    "- However, there are often more concise ways to specify the desired set of strings. <br>For example, the set `{file, file1, file2}` can be specified by the pattern `file(1|2)?`. <br>We say that this pattern matches each of the three strings. [Wanna check?](https://regexr.com/48om5)\n",
    "\n",
    "> In most formalisms, if there exists at least one regular expression that matches a particular set then there exists an infinite number of other regular expressions that also match it, i.e. **the specification is not unique**.<br>\n",
    "For example, the string set `{file, file1, file2}` can also be specified by the pattern `file\\d?`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## 2. The math of Regular Expressions\n",
    "\n",
    "- The concept of **Regular Expressions** originated from **[Regular Languages](https://en.wikipedia.org/wiki/Regular_language)**. \n",
    "\n",
    "- **Regular Expressions** describe **Regular Languages** in **[Formal Language Theory](https://en.wikipedia.org/wiki/Formal_language)**.\n",
    "\n",
    "> ***Formal Language Theory***: In mathematics, computer science, and linguistics, a **formal language** consists of words whose letters are taken from an alphabet and are **well-formed according to a specific set of rules**. The field of formal language theory studies primarily the purely syntactical aspects of such languages—that is, their internal structural patterns.\n",
    "\n",
    "> ***Regular Languages***: A regular language is a category of **formal languages** which can be expressed using a regular expression. ![](images/formal-lang-theory.png)\n",
    "\n",
    "- **Note**: Today, many regular expressions engines provided by modern programming languages are augmented with features that allow recognition of languages that <span style=\"color:red;\">**cannot**</span> be expressed by a classic regular expression!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Uses of Regular Expressions\n",
    "\n",
    "Some important usages of regular expressions are:\n",
    "\n",
    "- Check if an input honors a given pattern; for example, we can check whether a value entered in a HTML formulary is a valid e-mail address\n",
    "\n",
    "\n",
    "- Look for a pattern appearance in a piece of text; for example, check if either the word \"color\" or the word \"colour\" appears in a document with just **one scan**\n",
    "\n",
    "\n",
    "- Extract specific portions of a text; for example, extract the postal code of an address\n",
    "\n",
    "\n",
    "- Replace portions of text; for example, change any appearance of \"color\" or \"colour\" with \"red\"\n",
    "\n",
    "\n",
    "- Split a larger text into smaller pieces, for example, splitting a text by any appearance of the dot, comma, or newline characters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. A brief history of Regular Expressions\n",
    "\n",
    "> *The story begins with a neuroscientist and a logician who together tried to understand how the human brain could produce complex patterns using simple cells that are bound together.*\n",
    "\n",
    "- In 1943, neurophysiologists ***Warren McCulloch*** and ***Walter Pitts*** published ***\"A logical calculus of the ideas immanent in nervous activity\"***. This paper not only represented the beginning of the regular expressions, but also proposed the first mathematical model of a neural network.\n",
    "\n",
    "\n",
    "- In 1956, ***Stephen Kleene*** wrote the paper ***\"Representation of events in nerve nets and finite automata\"***, where he coined the terms **regular sets** and **regular expressions** and presented a simple algebra.\n",
    "\n",
    "\n",
    "- In 1968, the Unix pioneer ***Ken Thompson***  took Kleene's work and extended it, publishing his studies in the paper ***\"Regular Expression Search Algorithm\"***. Ken Thompson's work didn't end in just writing a paper. He also implemented Kleene’s notation in the editor ***QED***. The aim was that the user could do advanced pattern matching in text files. The same feature appeared later on in the editor ***ed***.\n",
    "\n",
    "> To search for a Regular Expression in ed you wrote `g/<regular expression>/p` The letter g meant global search and p meant print the result. The command — `g/re/p` — resulted in the standalone program grep, released in the fourth edition of Unix 1973.<br><span style=\"color:red;\">However, **grep** didn’t have a complete implementation of regular expressions.</span>\n",
    "\n",
    "- In 1979, ***Alfred Aho*** developed ***egrep (extended grep)*** in the seventh edition of Unix. The program egrep translated any regular expressions to a corresponding [DFA](https://en.wikipedia.org/wiki/Deterministic_finite_automaton).\n",
    "\n",
    "\n",
    "- In 1987, ***Larry Wall*** created the scripting language ***Perl***. Regular Expressions are seamlessly integrated in Perl, even with its own literals. Hence, Perl pushed the regular expressions to the mainstream. The implementation in Perl went forward and added many modifications to the original regular expression syntax, creating the so-called ***Perl flavor***.\n",
    "\n",
    "### Some other worth mentioning milestones\n",
    "\n",
    "- The IEEE thought their POSIX standard has tried to standardize and give better Unicode support to the regular expression syntax and behaviors. This is called the ***POSIX flavor*** of the regular expressions.\n",
    "\n",
    "\n",
    "- In late 1980s, ***Henry Spencer*** wrote ***\"regex\"***, a widely used software library for regular expressions in C programming langauge.\n",
    "\n",
    "\n",
    "### Here is a brief timeline to summarize...\n",
    "\n",
    "![](images/history.png)\n",
    "\n",
    "### Regex today\n",
    "\n",
    "- It was the rise of the web that gave a big boost to the Perl implementation of regex, and that's where we get the modern syntax of regular expressions today; it really comes from Perl. `Apache`, `C`, `C++`, `the .NET languages`, `Java`, `JavaScript`, `MySQL`, `PHP`, `Python`, `Ruby` all of these are endeavoring to be Perl-compatible languages and programs. There's also a library called the `PCRE` library that stands for Perl-Compatible Regular Expression library.\n",
    "\n",
    "\n",
    "- Today, the standard Python module for regular expressions—`re`—supports only Perl-style regular expressions. There is an [effort](https://pypi.python.org/pypi/regex) to write a new regex module with better POSIX style support. This new module is intended to replace Python's `re` module implementation eventually. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Understanding the Regular Expression Syntax\n",
    "\n",
    "A regex pattern is a simple sequence of characters. The components of a regex pattern are:\n",
    "\n",
    "- **literals (ordinary characters)**: these characters carry no special meaning and are processed as it is.\n",
    "\n",
    "- **metacharacters (special characters)**: these characters carry a special meaning and processed in some special way.\n",
    "\n",
    "\n",
    "![](images/components.png)\n",
    "\n",
    "Let's start with a simple example.\n",
    "\n",
    "Consider that we have got the list of several filenames in a folder.\n",
    "\n",
    "```\n",
    "file1.xml\n",
    "file1.txt\n",
    "file2.txt\n",
    "file15.xml\n",
    "file5.docx\n",
    "file60.txt\n",
    "file5.txt\n",
    "```\n",
    "\n",
    "And we want to filter out only those filenames which follow a specific pattern, i.e.  `file<one or more digits>.txt`.\n",
    "\n",
    "> Let's try to do this on an online tool to learn, build, & test Regular Expressions (RegEx / RegExp), [RegExr](https://regexr.com).\n",
    "\n",
    "So, the regular expression we need here is:\n",
    "\n",
    "`file\\d+\\.txt`\n",
    "\n",
    "This expression can be understood as follows:\n",
    "\n",
    "- `file` is a substring of literals which are matched with the input as it is.\n",
    "\n",
    "- `\\d` is a metacharacter which instructs the software to match this position with a digit (0-9).\n",
    "\n",
    "- `+` is also a metacharacter which instructs the software to match one or more iterations of the preceeding character (`\\d` in this case)\n",
    "\n",
    "- `\\.` is a literal. `.` is a metacharacter but we want to use it as a literal in this case. Hence, we escape it using `\\` character.\n",
    "\n",
    "- `txt` is a substring of literals which are matched with the input as it is.\n",
    "\n",
    "![](images/example1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/memes/meme2.jpg)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
